Lab Notebook

Back to Part 6

Note: The science behind sequence matching

Analysis of a newly isolated DNA molecule by comparing its sequence with all other known sequences in search of a match involves database searching and sequence alignment. The ultimate goal of this type of analysis is to determine whether the new sequence bears a significant degree of similarity (or homology) to another known sequence.

Perhaps the most important tool necessary for sequence matching is access to a comprehensive and up-to-date sequence database. GenBank, the EMBL nucleotide sequence database, and the DNA Database of Japan (DDBJ) are three partners in a longstanding collaboration to collect all publicly available sequence data. Sites in Bethesda, Maryland (USA), Hinxton (UK), and Mishima (Japan) exchange new sequence data and updates over the Internet everyday and make the information immediately available to everyone by e-mail, anonymous ftp, and the World Wide Web.

Next in importance is the computer program used to search the database. Several different mathematical programs allow two sequences to be compared with each other and determine the degree of similarity between them. The BLAST programs (BLAST is an acronym for Basic Local Alignment Search Tool) are among the most popular programs. They offer a good combination of speed, sensitivity, flexibility, and statistical rigorousness. (See interpreting BLAST search results.)

In what situations would a scientist search sequence databases? As an example, sequence matching can be used to determine whether a newly identified DNA sequence is part of a known gene. In the simplest scenario, if a new sequence is identical or almost identical (except for a few nucleotide changes) to that of a gene in the sequence database, it is reasonable to conclude that the new sequence is either part of the same gene or of a closely related gene. But what if two sequences which appear to be different share sections that are identical? How do you know whether the identical sections are due to chance or indicate some meaningful relationship between the two sequences? Sequence analysis using BLAST or another program provides a "similarity score" to help answer this question.

If the function of a particular DNA sequence is already known–for example, the 16S rRNA gene we have been working with in this lab–comparing its sequence with that of the same gene from another species of bacteria provides information about the evolutionary relationship between the two bacterial species. The assumption here is that the number of positions that differ in the nucleotide sequence is proportional to the time elapsed since the two species formed their own lines of descent from a common predecessor.

However, not all DNA sequences change at a constant rate over time. For example, it is not at all clear whether all organisms experience similar mutation rates from purely environmental factors (from increased UV exposure, for example). If the DNA sequence has or has had at some point in evolution a functional role, the rate of evolution and selection—which may be related to population size among other things—can affect its rate of change. And, in some cases, mutations are caused by deletions, insertions, and substitutions of long sequences of DNA rather than by single nucleotide changes. Finally, some sequences of DNA encode proteins with very specific structural requirements, and any change may prove unfavorable to the organism. Such sequences therefore do not tolerate change well and tend to remain the same for long periods of time. These are referred to as "conserved" regions. In contrast, sequences that can accommodate change more easily are referred to as "variable" regions.

Back to Part 6